Machine learning of electrophysiological signals for the prediction of ventricular arrhythmias: systematic review and examination of heterogeneity between studies

Summary Background Ventricular arrhythmia (VA) precipitating sudden cardiac arrest (SCD) is among the most frequent causes of death and pose a high burden on public health systems worldwide. The increasing availability of electrophysiological signals collected through conventional methods (e.g. electrocardiography (ECG)) and digital health technologies (e.g. wearable devices) in combination with novel predictive analytics using machine learning (ML) and deep learning (DL) hold potential for personalised predictions of arrhythmic events. Methods This systematic review and exploratory meta-analysis assesses the state-of-the-art of ML/DL models of electrophysiological signals for personalised prediction of malignant VA or SCD, and studies potential causes of bias (PROSPERO, reference: CRD42021283464). Five electronic databases were searched to identify eligible studies. Pooled estimates of the diagnostic odds ratio (DOR) and summary area under the curve (AUROC) were calculated. Meta-analyses were performed separately for studies using publicly available, ad-hoc datasets, versus targeted clinical data acquisition. Studies were scored on risk of bias by the PROBAST tool. Findings 2194 studies were identified of which 46 were included in the systematic review and 32 in the meta-analysis. Pooling of individual models demonstrated a summary AUROC of 0.856 (95% CI 0.755–0.909) for short-term (time-to-event up to 72 h) prediction and AUROC of 0.876 (95% CI 0.642–0.980) for long-term prediction (time-to-event up to years). While models developed on ad-hoc sets had higher pooled performance (AUROC 0.919, 95% CI 0.867–0.952), they had a high risk of bias related to the re-use and overlap of small ad-hoc datasets, choices of ML tool and a lack of external model validation. Interpretation ML and DL models appear to accurately predict malignant VA and SCD. However, wide heterogeneity between studies, in part due to small ad-hoc datasets and choice of ML model, may reduce the ability to generalise and should be addressed in future studies. Funding This publication is part of the project DEEP RISK ICD (with project number 452019308) of the research programme Rubicon which is (partly) financed by the 10.13039/501100003246Dutch Research Council (10.13039/501100003246NWO). This research is partly funded by the 10.13039/100019741Amsterdam Cardiovascular Sciences (personal grant F.V.Y.T).


Introduction
Sudden cardiac death (SCD) and out-of-hospital cardiac arrest are often precipitated by ventricular arrhythmias (VA) and account for 400.000 deaths annually in the United States alone. 1,2 Risk stratification for SCD and malignant VA in clinical practice is currently based on left ventricular (LV) systolic dysfunction. [3][4][5] However, LV dysfunction is inadequate as the sole surrogate marker for the underlying dynamic and complex mechanisms responsible for malignant VA. 6,7 The majority of patients who suffer an out-of-hospital cardiac arrest or SCD have preserved left ventricular systolic function. 8,9 New approaches to predict VA may be enabled by a combination of artificial intelligence (AI) and the increasing availability in electrophysiological signals obtained non-invasively using body-surface electrocardiography (ECG), intracardiac devices or wearable sensors. Machine learning (ML) and deep learning (DL) facilitate detection of ECG signatures and patterns that are unrecognizable by the human eye and might indicate sub-clinical pathology. 10 This extends the traditional identification of specific, often manually extracted features analysed in isolation as predictors of malignant VA and SCD. [11][12][13][14] Over the past decade, extensive research has been conducted on the use of ML and DL to predict malignant VA and SCD, of which the current state-of-the-art is unclear. [15][16][17][18] The aim of this systematic review and meta-analysis was to critically evaluate the merits and pooled accuracy of ML and DL models that use electrophysiological signals to predict malignant VA and SCD, and to explore the sources of heterogeneity between studies.

Methods
This review was reported according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA). The study protocol was registered on the international prospective register of systematic reviews (PROS-PERO, reference number: CRD42021283464). Below we formulated the research question according to use the PI-COTS system as provided by the CHecklist for critical Appraisal and data extraction for systematic Reviews of prediction Modelling Studies (CHARM)-checklist. 19,20 Population Subjects from whom electrophysiological signals were obtained for the purpose of predicting the occurrence of the outcome(s) of interest were included. Electrophysiological signals considered eligible were ECG, intracardiac device recorded electrograms (EGM), holter-ECG, signalaveraged ECGs (SAECG), cardiac stress test ECG, and electrophysiological studies. Studies investigating participants <18 years old were excluded, no other criteria regarding eligibility of the population were applied.

Index model
Supervised or semi-supervised ML or a DL model used to predict the outcome of interest, or any combinations thereof, were eligible. Studies were included regardless of the type of prediction model according to the checklist for CHARMS-checklist (i.e. development studies with and without external validation, external model validation with or without model updating). 19 Studies were included only if electrophysiological signals were used as sole or primary model input.

Outcome(s)
The outcome of interest was one (or a combination) of the following outcomes: (sustained) ventricular tachycardia (VT), ventricular fibrillation (VF), sudden cardiac death (SCD), in-hospital (IHCA) or out-of-hospital

Research in context
Evidence before this study Sudden cardiac deaths (SCD) and malignant ventricular arrhythmias (VA) represent a major public health problem globally. Although risk factors for SCD and malignant VA have been identified (e.g. a left ventricular ejection fraction ≤35%), the majority of events occur in individuals without any risk factors. Currently, there is no effective screening tool to identify at-risk individuals of either SCD or malignant VA. The emergence of artificial intelligence (AI) and increasing availability of electrophysiological signals obtained noninvasively using body-surface electrocardiography (ECG), intracardiac devices and wearable sensors could facilitate personalised prediction of SCD and malignant VA. We searched the MEDLINE (Ovid), EMBASE (Ovid), Scopus, Web of Science and Cochrane Library Databases electronic databases to identify studies published before August 2021 that developed a machine learning (ML) or deep learning (DL) model for prediction of malignant VA or SCD using electrophysiological signals. We found that the predictive performance of individual ML and DL models were generally high, and in particular ML and DL models derived from publicly available datasets had superior accuracy. However, these studies were characterised by a high risk of bias and methodological limitations that hinder their potential translation to clinical practice.

Added value of this study
This systematic review and meta-analysis examines the current state of AI-based models that use electrophysiological signals to predict for SCD and malignant VA. Our systematic assessment of ML and DL models revealed important methodological limitations that could affect the potential uptake of these models. We highlighted aspects necessary for adoption of ML and DL models in clinical practice, including external model validation, targeted model deployment, explainable AI and model transparency.
Implications of all the available evidence Predictive models developed using AI achieve high performance and enable automated and personalised predictions. However, methodological limitations have consequences for the generalisability, clinical utility and reproducibility of these models. In order for research on the intersection of medicine and AI to be relevant and useful in clinical practice, it is essential that future studies adhere to high methodological standards.
Articles cardiac arrest (OHCA), or appropriate ICD therapy (shock or antitachycardia-pacing (ATP). Binary and time-to-event outcomes were considered both eligible.

Timing and setting
The timing of predictions was at the moment of obtaining the electrophysiological signal, all prediction horizons were eligible. There were no restrictions on the setting the model was developed or validated in.

Literature search
The MEDLINE (Ovid), EMBASE (Ovid), Scopus, Web of Science and Cochrane Library Databases electronic databases were systematically searched to identify studies published before September 2021. Databases were searched on September 1st 2021 using the following terms: 'implantable cardioverter defibrillator', 'sudden cardiac death', 'machine learning' and 'electrocardiography'. The full search strategy is provided in the supplementary material (Supplementary Tables S1-S5). Such strategy, including terms and limits, was designed in collaboration with a medical information specialist. The reference lists of relevant papers were handsearched to identify studies potentially missed by the electronic search.

Study selection
The results from the electronic searches were imported into a reference management software and deduplicated. Two review authors (M.K, B.D) conducted screening of studies independently with disagreements resolved through discussion or arbitration of a third reviewer (F.T).

Risk of bias (quality) assessment
The risk of bias was assessed using PROBAST: A Tool to Assess the Risk of Bias and Applicability of Prediction Model Studies. 21 All studies were scored on risk of bias for four categories (i.e. participants, predictors, outcome, and analysis). Low overall risk of bias was assigned if each domain was scored as low risk. High overall risk of bias was assigned if at least one domain was judged to be high risk of bias. Unclear overall risk of bias was assigned if at least one domain was judged unclear, and all other domains as low. The risk of bias assessment was performed independently by two authors (M.K, B.D). In cases of disagreement, both authors attempted to reach consensus. If no consensus was reached, a third reviewer was consulted to settle the disagreement (F.T).

Synthesis of results
General study characteristics, study population and baseline characteristics (including sex distribution), type of electrophysiological signals used and analytical methods (i.e. model selection, feature selection, validation techniques) were extracted. Second, we extracted study estimates of sensitivity, specificity, positive predictive value, negative predictive value, accuracy, contingency tables and c-statistic (area under the curve). If studies reported insufficient details to reconstruct contingency tables, the respective authors were contacted to provide the missing data. Data extraction was performed by two independent reviewers (M.K, B.D). Studies were classified based on the database(s) used for model development in order to avoid overlap between studies that results from the use of publicly available datasets by multiple studies, and to reduce the potential for optimistically biased pooled performance estimates based on unrepresentative datasets. Databases that were classified as 'ad-hoc' met the following criteria: -The dataset was publicly available and may have been made available for challenges (e.g. the PhysioNet ECG challenge 22 ); -The dataset was developed with the primary aim for cooperative analysis and the development and evaluation of proposed new algorithms; -The dataset may have been used as data source for multiple individual studies with similar research questions, leading to overlapping study populations; -The dataset was considered unrepresentative (i.e. the dataset has an imbalanced outcome of interest that does not reflect a clinical setting, the datasets consists of outdated data, there is insufficient information on the origin of the data or population characteristics)

Statistics
Exploratory meta-analysis was performed to reflect on and explain variations in the predictive performances of ML and DL models. 23 Models were included in the metaanalysis if sufficient information was provided to reconstruct contingency tables consisting of true positive, false positive, true negative, and false negative results based on the specificity, sensitivity, prevalence and sample size. Pooled estimates of the diagnostic odds ratio (DOR) and the area under the summary receiver operator curve (AUROC) were calculated, the sensitivity and specificity were not pooled due to their dependency on the probability threshold. The DOR describes the odds of a positive prediction in those with the outcome relative to the odds of a positive prediction in those without the outcome. Summary receiver operator characteristic (ROC) curves were constructed based on a bivariate regression approach. 24 Using parametric bootstrapping, the 95% confidence intervals around the AUROC were calculated. 25 Pooled estimates of the predictive performance were calculated separately for models developed on an ad-hoc dataset (or a combination of ad-hoc datasets), taking into account the distinct differences in representativeness of these datasets. To reduce the risk of overlapping populations from ad-hoc databases between studies, the best performing model for each unique sample of subjects was selected and used to calculate pooled estimates. The I 2 statistic was calculated to quantify the amount of inconsistency between studies.
In cases of high heterogeneity, a series of sensitivity analyses were performed to explore potential sources of heterogeneity. First, we employed a leave-one-out approach in which we excluded one study at a time, to ensure that the results were not simply due to one large study or a study with an extreme result. Second, subgroup analyses were performed to examine whether the pooled accuracy of models varied by risk of bias, sample size, region of origin and the ad-hoc dataset that was used. Publication bias was visualised using funnel plots, Egger's test was used to test for publication bias. The trim and fill method proposed by Duval and Tweedie was used to estimate the number of studies missing from a meta-analysis and compute the summary estimate based on the complete data. 26 A P-value of less than 0.05 was considered to be statistically significant. R software, version 3.6.2 (R Core Team) was used to analyse the pooled result, specifically the Meta-Analysis of Diagnostic Accuracy and the General Package for Meta-Analysis libraries. [27][28][29] Ethics This meta-analysis study is exempt from ethics approval as data was collected and synthesised from previous studies.

Role of the funding source
The funding source had no role in the study design, data collection, data analyses, interpretation, or writing of report.

Results
A total of 2486 studies were identified through the MEDLINE (n = 685), EMBASE (n = 1208), Scopus (n = 587) and Cochrane (n = 6) databases searches. Another three studies were identified through scrutiny of reference lists of relevant studies. After deduplication, a total of 2197 studies remained. Fig. 1 displays a flow diagram of the study selection process. Frequent reasons for exclusion were: reporting on a diagnostic model instead of a predictive model (n = 92), ineligible study outcome (n = 67) and no ML or DL approach (n = 42). Ultimately, a total of 46 studies were included in this review. [15][16][17][18] Out of these 46 studies, 36 used one or more ad-hoc dataset(s) and were pooled in separate meta-analysis. 30 Machine learning and deep learning models developed on clinically-defined datasets The characteristics of studies are summarised in Table 1, details on the electrophysiological signals used are displayed in Supplementary Material Table S6. [15][16][17][18]38,45,53,56,62,64 Two studies used intracardiac EGMs, 15,56 seven used body surface ECG recordings 16,17,38,45,53,62,64 and one study used ventricular monophasic action potentials (MAP) as model input. 18 ECGs ranged from 10 s till 24 h in duration and differed in number of leads (1-, 3-, 7-and 12leads) and sampling rate (125 Hz-1600 Hz). Support vector machine classifiers were implemented as prediction model in six studies, 15,17,18,56,62,64 ensemble learning methods (random forests, decision tree) in three studies 15,38,45 and artificial neural network in one study. 53 Kwon et al. and Rogers et al. applied a deep learning model based on a convolutional neural network (CNN). 16,18 Six studies developed a ML model for shortterm prediction (horizons within a range 1 min till 72 h before event), the other four studies used a baseline recording as input to predict the event during a follow-up period that ranged from 21 till 44 months (i.e. long-term prediction). K-fold cross-validation and leave-one-outcross validation were used for model validation in four studies validation, 17,56,62,64 whereas a hold-out test set was used in six studies. 15,16,18,38,45,53 External validation of the model was performed in two studies. 16,56 Meta-analysis was performed for eight studies, [15][16][17][18]38,53,62,64 two studies did not report sufficient information regarding the predictive performance of the model to be able to reconstruct contingency tables. 45,56 The sensitivity and specificity of these models ranged between 0.647-0.929 and 0.181-0.980, respectively (Supplementary Material Figs. S3 and S4). Prediction horizons differed substantially between individual studies, ranging from a time-to-event of minutes to hours (i.e. short-term) to a time-to-event of months to years (i.e. long-term). The pooled performance of five models (20,479 patients) developed for short-term prediction demonstrated a DOR of 21.45 (95% CI 11.42-40.29) and a summary AUROC of 0.856 (95% CI 0.755-0.909), with high heterogeneity (I 2 = 89%) between studies (Figs. 2 and 3a). Subgroup analyses for low vs. high risk of bias and sample size <500 vs. ≥500 subjects are displayed in the Supplementary Material Figs. S5 and S6. Leave-one-out sensitivity analysis showed each individual study to significantly affect the pooled estimate of the DOR (P < 0.05) ( Supplementary  Fig. S7). Three studies reported on a model developed to predict on a median time-to-event of 28-44 months (cumulative 702 patients), with a pooled DOR of 21.79 (95% CI 0.52-9.13.46, I 2 = 93%) and a summary AUROC of 0.876 (95% CI 0.642-0.980). No sensitivity analyses were performed to explain heterogeneity considering the low number of studies.
Funnel plots for publication bias were visualised and are displayed in Supplementary Material Fig. S8, Egger's tests showed no evidence of publication bias. The trim-and-fill method identified two additional missing studies for short-term prediction that resulted in a pooled DOR of 13.99 (95 CI% 6.85-28.54), which is lower compared to the original analysis. Considering the Articles low number of studies (k < 10), this assessment may not be reliable.

Machine learning and deep learning models developed on ad-hoc datasets
The characteristics of ML and DL models developed using an ad-hoc dataset are summarised in Table 2. A total of 36 studies have been included, derived from eight different ad-hoc datasets. Detailed descriptions of the ad-hoc datasets are displayed in Table 3. The MIT-BIH SCD Holter database (SCDH) and the normal sinus rhythm database (NSRBD) were used in 27 and 28 studies, respectively. 22 The SCDH database consists of 23 24-h ECG recordings of patients who suffered a sustained ventricular tachyarrhythmia (20 patients with VF, 3 with VT). Other open dataset used were the Creighton University ventricular tachyarrhythmia database (CUDB, 6 studies), 73 Spontaneous Ventricular Tachyarrhythmia Database (MVTDB, 2 studies), 22 AHA Database for Evaluation of Ventricular Arrhythmia Detectors (AHADB, 2 studies), 22 the Fantasia database (1 study), 22 Malignant Ventricular Arrhythmia Database (VFDB, 1 study) 75 and the Paroxysmal Atrial Fibrillation Prediction challenge Database (PAFDB, 1 study). 74 Supplementary Table S7 summarises the electrophysiological features that were used as input to the prediction models. Most commonly, studies used heart rate variability as model input, in particular (a combination of) features extracted the time-domain, frequencydomain, time-domain, time-frequency-domain and nonlinear features. Other ECG features were related to the ECG morphology, such as intervals and amplitude of the QRS complex and ventricular repolarisation features (e.g. T-wave alternans). None of the studies reported on the external validation of a prediction model.
Overall, 24 studies (344 unique patients) that reported on models developed using ad-hoc datasets provided sufficient information for meta-analysis of pooled data. The sensitivity and specificity of these models ranged between 0.750-1.000 and 0.171-1.000, respectively ( Supplementary Fig. S9). Predictions horizons ranged from a time-to-event of 20 s until 3 h. The pooled DOR of the seven best performing models for each of the (combination of) datasets was 282.04 (95% CI 62.96-1263.40) and the summary AUROC was 0.919 (95% CI 0.867-0.952) (Fig. 4)

Risk of bias assessment
The risk of bias assessment is presented in Supplementary Figs. S1 and S2. The studies that reported on a model developed using clinically-defined data were scored as low (4 studies), high (2 studies) and unclear risk of bias (4 studies). In studies that reported on model development using ad-hoc datasets, 5 studies were scored as low risk, 24 as high risk and 7 as unclear risk.

Discussion
We systematically identified and summarised ML and DL models that used electrophysiological signals to predict malignant VA and SCD, and conducted exploratory meta-analyses to explain the sources of heterogeneity. AI has the potential to extract and process features from high dimensional complex electrophysiological signals and learn complex, hidden relationships between these features and the onset of malignant VA or Name Subjects included in the database No. recordings Type Frequency Massachusetts Institute of Technology-Beth Israel Hospital SCD Holter database (SCDH) 22 Recordings of subjects before SCD or sustained VT onset as well as a few seconds later. 18 subjects (8 female, 13 female, 2 unknown) had underlying sinus rhythm (4 with intermittent pacing), 1 subject was continuously paced, and 4 subjects were diagnosed with atrial fibrillation. All subjects had a sustained ventricular and most had an actual cardiac arrest.
23 (20 subjects    Articles SCD. Overall, ML and DL models showed high predictive performance, with models developed using (a combination of) ad-hoc datasets achieving particularly excellent performance with a summary AUROC of 0.919 (95% CI 0.867-0.952). On the other hand, studies were characterised by high risk of bias and considerable heterogeneity in terms of model performance, electrophysiological signals used, sample sizes and settings. In addition, very few studies have reported on the performance of a model when tested on an external patient cohort, which is crucial for assessing its generalisation ability. It is essential for these important methodological considerations to be addressed in future studies in order for AI models to be adopted in clinical practice.

Current barriers to clinical implantation: external validation and model deployment
The majority of research activity in the field of VA prediction using ML and DL has been undertaken in a pre-clinical setting using ad-hoc datasets. In particular, two ad-hoc datasets (SCDH and NSRD) comprising a total of 41 patients have been exhaustively utilised for model development (respectively 27 and 28 studies). Publicly available datasets have stimulated progress in  model development over the past decades, by ensuring quality control and circumventing barriers such as patient consent, quality control, costs and disparate data sources. 76 Nevertheless, these ad-hoc datasets were limited in sample size and amount of electrophysiological signals, making the derived models vulnerable to overfitting. This may lead to overly optimistic estimates of model performance. Moreover, the robustness of these model may be jeopardised by the use of datasets that do not accurately represent the target population, leading to a model that is susceptible to approximate noise in the training data rather than underlying patterns of interest. 77 Expanding current ad-hoc datasets through the inclusion of more subjects and electrophysiological signals, and subsequently conducting external validation of derived models is paramount for establishing the robustness, reproducibility and generalizability. 23 Second, ML and DL models could serve distinct clinical purposes (e.g. early-warning system, risk stratification, screening tool for general population), and therefore require different integration within clinical workflows. However, in order for ML models to have a meaningful impact on clinical practice it is critical to integrate them into medical workflows so that their impact on patients and clinicians can be assessed. The ECG AI-Guided Screening for Low Ejection Fraction (EAGLE) trial was among the first to specifically evaluate the use of an AI-tool for screening of heart failure patients in an integrated, real-world workflow using ECG. 78 The EAGLE trial demonstrated that use of the AI-ECG model increased the number of low LVEF diagnoses despite only a modest increase in the use of echocardiography was observed. At present, no trial has evaluated the impact of a ML-based model for the prediction of malignant VA or SCD in clinical practice. Finally, the impact of an ML or DL model on clinical practice is largely dependent on epidemiological factors such as the pre-test probability. For example, Au-Yeung et al. performed a secondary analysis of patients implanted with an ICD in the randomised-controlled SCD-HeFT trial, using HRV features extracted from EGMs for the prediction of appropriate ICD-therapy. 15 Despite the reasonable AUROC of the developed models (AUROC = 0.81), this still led to a disproportional absolute number of false positive predictions with a prevalence of 3.3% of appropriate ICD therapy. In addition, the model developed by Kwon  Their findings showed that the arrhythmic risk was predicted by prolonged phase II repolarisation which potentially reflects abnormal calcium handling, providing clinicians with interpretable ML predictions. In addition, considering the dynamic and complex nature of malignant VA onset it is important for prediction models to take into account persistent substrate as well as transient triggers for arrhythmia onset. The potential of repeated electrophysiological recordings per patients instead of features measured once at baseline was assessed by Perez-Alday et al., who found differences in short-term and longterm predictive accuracy of ECG features for SCD. 80 Leveraging ML techniques for survival predictions using time-varying covariates has the potential to capture triggers for malignant VA on top of baseline predictors. 81

Limitations
An important limitation to this systematic review was the high percentage of included studies that reported insufficient data to be added meta-analysis of included papers (14 studies reported insufficient data to calculate contingency tables for meta-analysis), which could have affected the pooled summary estimates. Given the exploratory nature of the meta-analysis the pooled estimates are provided primarily for reference, and should be considered as hypothesis-generating. Second, this study did not include conventional statistical methods which impedes comparisons between AI and statical approaches. Third, recent population wide autopsy data published by Tseng et al. illustrated that 40% of deaths attributed to stated SCD were not sudden or unexpected, and nearly half of presumed SCDs were not arrhythmic. 82 The pooled results in this meta-analysis could be imprecise considering both SCD and malignant VA were eligible as prediction outcome.

Conclusion
Machine learning and deep learning have a potential for personalised prediction of malignant ventricular arrhythmias and could provide clinicians with early warning-systems and risk-stratification tools. Despite a substantial number of studies using ML or DL models to predict malignant VA and SCD, studies were predominately conducted using small ad-hoc datasets, lacked an external validation and were in general characterised by high risk of bias. It is pivotal that future studies meet methodological standards, are derived from multi-centric clinical datasets that capture sufficient between-subject variation, and are integrated into clinical work-flows in parallel with conventional care to assess their reproducibility, generalisability and utility. Data sharing statement All data for this systematic review and meta-analysis were obtained from published studies. Data extracted for this review will be made available upon a reasonable request. For access, please email the corresponding author. The database search strategies are provided as Supplementary Material.

Declaration of interests
SN has Grants or contracts from National Institutes of Health HL149134. AW has Grants or contracts from Dutch Heart Foundation (Predict2), consultancy fee from LQTtherapeutics and Cydan and participates on a Data Safety Monitoring Board or Advisory Board for the LEAP trial. RK, FT, MK, SR, BD, PC, NB have no conflict of interests.